docs: Eradicate unpaged SELECTs and warn about their drawbacks #1068

wprzytula · 2024-08-27T15:04:30Z

Follow-up of #1061.

Motivation

Unpaged SELECTs should be discouraged due to the following drawbacks:

However, query_unpaged will return all results in one, possibly giant, piece
(unless a timeout occurs due to high load incurred by the cluster).
This:

increases latency,

has large memory footprint,

puts high load on the cluster,

is more likely to time out (because big work takes more time than little work,
and returning one large piece of data is more work than returning one chunk of data).

What's done

the above caveat is added in the documentation book,
existing examples are modified to use query_iter instead,
giant bonus: a broad overview over CQL statement kinds is added, together with best practices wrt preparing, batching and paging.

Note to reviewers

Please make sure (by reading both crate docs and the docs book) that in all places it is clear enough that SELECTs by default should be done using query_iter rather than query_unpaged, and why so.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
~~[ ] I added relevant tests for new features and bug fixes.~~
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
~~[ ] I have provided docstrings for the public items that I want to introduce.~~
I have adjusted the documentation in ./docs/source/.
~~[ ] I added appropriate Fixes: annotations to PR description.~~

In paged.md such formulation was already rephrased in PR with paging API refactor.

We shouldn't encourage users to perform unpaged SELECTs. Normally, SELECTs should be paged, therefore the whole doc series on CQL data types is modified to use query_iter instead of query_unpaged.

We shouldn't encourage users to perform unpaged SELECTs. Normally, SELECTs should be paged, therefore the example in README is modified to use query_iter instead of query_unpaged.

We shouldn't encourage users to perform unpaged SELECTs. Normally, SELECTs should be paged, therefore the quickstart example is modified to use query_iter instead of query_unpaged.

github-actions · 2024-08-27T15:09:36Z

cargo semver-checks found no API-breaking changes in this PR! 🎉🥳
Checked commit: 8604a33

wprzytula · 2024-08-27T19:15:09Z

NOTE: examples in examples/*.rs are still not adjusted this way. I suggest a separate PR for them.

fee-mendes

In general it LGTM.

One thing I just realized is that paged.md still contains the old performance disclaimer. I thought query_iter vs the new execute_unpaged had their own caveats.

Either way, I guess it is fine to follow-up somewhere else and it shouldn't block the existing work.

muzarski · 2024-08-28T10:21:44Z

In docs/execution-profiles/priority.md:

let mut query = Query::from("SELECT * FROM ks.table");

// Query is not assigned any specific profile, so session's profile is applied.
// Therefore, the query will be executed with Consistency::One.
session.query_unpaged(query.clone(), ()).await?;

query.set_execution_profile_handle(Some(query_profile.into_handle()));
// Query's profile is applied.
// Therefore, the query will be executed with Consistency::Two.
session.query_unpaged(query.clone(), ()).await?;

query.set_consistency(Consistency::Three);
// An option is set directly on the query.
// Therefore, the query will be executed with Consistency::Three.
session.query_unpaged(query, ()).await?;

I'd use Session::query_single_page here.

muzarski

But otherwise, LGTM

wprzytula · 2024-08-28T11:32:16Z

In docs/execution-profiles/priority.md:

let mut query = Query::from("SELECT * FROM ks.table");

// Query is not assigned any specific profile, so session's profile is applied.
// Therefore, the query will be executed with Consistency::One.
session.query_unpaged(query.clone(), ()).await?;

query.set_execution_profile_handle(Some(query_profile.into_handle()));
// Query's profile is applied.
// Therefore, the query will be executed with Consistency::Two.
session.query_unpaged(query.clone(), ()).await?;

query.set_consistency(Consistency::Three);
// An option is set directly on the query.
// Therefore, the query will be executed with Consistency::Three.
session.query_unpaged(query, ()).await?;

I'd use Session::query_single_page here.

Wouldn't query_single_page (comparing to query_iter) introduce more noise, as it involves manual passing paging state?

wprzytula · 2024-08-28T11:35:26Z

A general thought:

Now all SELECTs are recommended to be done using {query,execute}_iter. This means that all SELECTs would introduce considerable overhead of using the RowIteratorWorker, even if the SELECT only returns a single row. This isn't optimal.

However, let's stick to this approach, because it guarantees correctness wrt paging. In GRER (Giant Request Execution Refactor) we'll think about redesigning iteration over pages so that it's not that expensive, so by 0.17 we should have this solved.
WDYT @fee-mendes @Lorak-mmk @muzarski ?

fee-mendes · 2024-08-28T11:37:25Z

I think it is fine. As long as we clarify the Performance section in paged.md as I pointed out earlier and discuss these points in depth there.

Probably that's how you want to introduce the paged queries doc

wprzytula · 2024-08-28T11:41:38Z

One thing I just realized is that paged.md still contains the old performance disclaimer. I thought query_iter vs the new execute_unpaged had their own caveats.

What disclaimer do you have in mind? All that I can see seem to be up-to-date.

muzarski · 2024-08-28T11:46:19Z

In docs/execution-profiles/priority.md:

let mut query = Query::from("SELECT * FROM ks.table");

// Query is not assigned any specific profile, so session's profile is applied.
// Therefore, the query will be executed with Consistency::One.
session.query_unpaged(query.clone(), ()).await?;

query.set_execution_profile_handle(Some(query_profile.into_handle()));
// Query's profile is applied.
// Therefore, the query will be executed with Consistency::Two.
session.query_unpaged(query.clone(), ()).await?;

query.set_consistency(Consistency::Three);
// An option is set directly on the query.
// Therefore, the query will be executed with Consistency::Three.
session.query_unpaged(query, ()).await?;

I'd use Session::query_single_page here.

Wouldn't query_single_page (comparing to query_iter) introduce more noise, as it involves manual passing paging state?

That's true. My only concern was that we use query_unpaged for a SELECT * statement. But this doc wasn't actually created to showcase the query_unpaged/query_iter differences (the result is not even used here), so I think it's fine to leave it as query_unpaged.

muzarski · 2024-08-28T11:47:00Z

A general thought:

Now all SELECTs are recommended to be done using {query,execute}_iter. This means that all SELECTs would introduce considerable overhead of using the RowIteratorWorker, even if the SELECT only returns a single row. This isn't optimal.

However, let's stick to this approach, because it guarantees correctness wrt paging. In GRER (Giant Request Execution Refactor) we'll think about redesigning iteration over pages so that it's not that expensive, so by 0.17 we should have this solved. WDYT @fee-mendes @Lorak-mmk @muzarski ?

In examples refactor (that I'm currently working on), I decided to make use of query_single_page when we are sure that result will consist of a single row (or when we want to showcase the QueryResult API).

wprzytula · 2024-08-28T12:12:36Z

A general thought:
Now all SELECTs are recommended to be done using {query,execute}_iter. This means that all SELECTs would introduce considerable overhead of using the RowIteratorWorker, even if the SELECT only returns a single row. This isn't optimal.
However, let's stick to this approach, because it guarantees correctness wrt paging. In GRER (Giant Request Execution Refactor) we'll think about redesigning iteration over pages so that it's not that expensive, so by 0.17 we should have this solved. WDYT @fee-mendes @Lorak-mmk @muzarski ?

In examples refactor (that I'm currently working on), I decided to make use of query_single_page when we are sure that result will consist of a single row (or when we want to showcase the QueryResult API).

Remember that even if the result is going to consist of a single row, in some situations (e.g. when there are a lot of tombstones - @fee-mendes please confirm) there can be a number of empty pages received before we finally get one with our expected row.

fee-mendes · 2024-08-28T12:47:10Z

Remember that even if the result is going to consist of a single row, in some situations (e.g. when there are a lot of tombstones - @fee-mendes please confirm) there can be a number of empty pages received before we finally get one with our expected row.

Correct. If the query is paged, empty replica pages will kick in. If it is unpaged, then you retrieve the full result set.

In general, it is hard to guarantee a SELECT will return only a single row, or a single page. LIMIT 1, aggregations (COUNT()), and key-value tables are probably the only exceptions.

fee-mendes · 2024-08-28T13:05:40Z

One thing I just realized is that paged.md still contains the old performance disclaimer. I thought query_iter vs the new execute_unpaged had their own caveats.

What disclaimer do you have in mind? All that I can see seem to be up-to-date.

I am reading paged.md which says:

### Performance
Performance is the same as in non-paged variants.\
For the best performance use [prepared queries](prepared.md).

Well, this doesn't seem to be accurate. In fact, it was never accurate.

It doesn't specify which is performant (query_single_page or *_iter or both?)
Under which circumstances unpaged or query_single_page are more performant than *_iter?
It doesn't explain why nor when one should choose query_single_page vs *_iter
It leads one to believe that they can use non-paged queries without any discrimination - as we say "performance is the same" - which is never true;

Maybe rather than talking about Performance, we should just recommend what's better instead. Always prefer paged queries, from the driver's perspective use X as it performs better due to A, B, C. We also provide Y, which is slightly less performant on the client-side given D, E, F.

// some warning disclaimer on unpaged //

wprzytula · 2024-08-28T13:12:36Z

I am reading paged.md which says:

### Performance
Performance is the same as in non-paged variants.\
For the best performance use [prepared queries](prepared.md).

Right. This particular statement was never true wrt {query,execute}_iter, which are always at least a bit less performant than {query,execute}_{single_page,unpaged}. I'll correct those recommendations as you suggest.

fee-mendes · 2024-08-28T13:19:44Z

Right. This particular statement was never true wrt {query,execute}_iter, which are always at least a bit less performant than {query,execute}_{single_page,unpaged}. I'll correct those recommendations as you suggest.

Well, be mindful though - that performance is subjective even in the {query,execute}_{single_page,unpaged} case. I'd argue that single_page is always more performant than unpaged, as it requires less memory footprint given that unpaged queries may return a large QueryResult when scanning large keys, not to mention the pressure on the server-side.

Hence why I think the focus should shift from "Performance" to our recommended "Best Practices", with both client/server in mind.

docs/source/queries/result.md

Lorak-mmk · 2024-08-29T10:51:52Z

docs/source/queries/result.md

+
+> To sum up, for SELECTs that may return return a lot of data prefer paged queries,
+> e.g. with `Session::query_iter()` (see [Paged queries](paged.md)).
+


Maybe remove a remark about a lot of data because of the tombstone problem?

Or maybe rephase it as (especially those that may return a lot of data)?

docs/source/queries/paged.md

wprzytula · 2024-08-29T11:11:25Z

New commit addressing @fee-mendes' comments: giant overview of statements and methods of executing them, together with best practices.

wprzytula · 2024-08-29T11:14:49Z

Addressed Karol's comments.

docs/source/queries/simple.md

Lorak-mmk · 2024-08-29T11:47:30Z

docs/source/queries/paged.md

+| Cluster load            | potentially **HIGH** for large results, beware!                                                                         | normal                                                                                               | normal                                                                                            |
+| Driver overhead         | low - simple frame fetch                                                                                                | low - simple frame fetch                                                                             | considerable - `RowIteratorWorker` is a separate tokio task                                       |
+| Feature limitations     | none                                                                                                                    | none                                                                                                 | speculative execution not supported                                                               |


I'm wondering if we should print warning / erorr query if speculative execution is set on execution profile used in {query ,execute}_iter.

Hmm, that might be a good idea, but OTOH forcing using different profiles for iter and non-iter execution and thus being inconvenient (for speculative execution users).

I worry about situation where user wants to use speculative execution, but silently doesn't because of _iter methods. You raised good point, I'm not sure how to best handle this possible issue.

Lorak-mmk · 2024-08-29T11:47:54Z

docs/source/queries/paged.md

Isn't SELECT with limit still susceptible to tombstones problem?

@fee-mendes ?

Lorak-mmk · 2024-08-29T12:01:01Z

docs/source/queries/queries.md

+| Suitable operations   | - in general: operations with empty result set (non-SELECTs)</br> - as possible optimisation: SELECTs with LIMIT clause | - in general: all SELECTs                                                                                                                                            |
+
+For more detailed comparison and more best practices, see [doc page about paging](paged.md).


ditto about LIMIT

docs/source/queries/queries.md

We shouldn't encourage users to perform unpaged SELECTs. Normally, SELECTs should be paged, therefore caveat about that is added in two relevant places in docs.

This wasn't reverted when design regarding PageSize changed during code review.

In order to avoid API misuse, much knowledge is now shared in a structured way of tables, and best practices are described to aid users.

wprzytula · 2024-08-29T13:20:38Z

Addressed another round of @Lorak-mmk and @muzarski comments.

wprzytula added 4 commits August 27, 2024 16:59

docs: rephrase paging purpose in a leftover spot

17316cd

In paged.md such formulation was already rephrased in PR with paging API refactor.

docs/data_types: use query_iter for SELECTs

dc01d0d

We shouldn't encourage users to perform unpaged SELECTs. Normally, SELECTs should be paged, therefore the whole doc series on CQL data types is modified to use query_iter instead of query_unpaged.

docs: use query_iter in SELECT example in README

5975fa9

We shouldn't encourage users to perform unpaged SELECTs. Normally, SELECTs should be paged, therefore the example in README is modified to use query_iter instead of query_unpaged.

docs: use query_iter for SELECT in quickstart example

112ed5d

We shouldn't encourage users to perform unpaged SELECTs. Normally, SELECTs should be paged, therefore the quickstart example is modified to use query_iter instead of query_unpaged.

wprzytula requested a review from fee-mendes August 27, 2024 15:04

wprzytula self-assigned this Aug 27, 2024

wprzytula requested review from muzarski and Lorak-mmk August 27, 2024 15:04

wprzytula added area/documentation Improvements or additions to documentation area/statement-execution labels Aug 27, 2024

wprzytula added this to the 0.14.0 milestone Aug 27, 2024

fee-mendes approved these changes Aug 27, 2024

View reviewed changes

muzarski approved these changes Aug 28, 2024

View reviewed changes

muzarski mentioned this pull request Aug 28, 2024

examples: replace query_unpaged for SELECT queries #1069

Merged

4 tasks

Lorak-mmk reviewed Aug 29, 2024

View reviewed changes

docs/source/queries/result.md Outdated Show resolved Hide resolved

Lorak-mmk reviewed Aug 29, 2024

View reviewed changes

docs/source/queries/paged.md Show resolved Hide resolved

wprzytula force-pushed the eradicate-unpaged-selects-from-docs branch from 349bc79 to 5b39d8f Compare August 29, 2024 11:14

wprzytula requested review from Lorak-mmk, fee-mendes and muzarski August 29, 2024 11:14

Lorak-mmk reviewed Aug 29, 2024

View reviewed changes

docs/source/queries/simple.md Outdated Show resolved Hide resolved

Lorak-mmk reviewed Aug 29, 2024

View reviewed changes

muzarski reviewed Aug 29, 2024

View reviewed changes

docs/source/queries/queries.md Show resolved Hide resolved

wprzytula added 3 commits August 29, 2024 15:10

docs: add caveat about unpaged SELECTs' drawbacks

8e4c555

We shouldn't encourage users to perform unpaged SELECTs. Normally, SELECTs should be paged, therefore caveat about that is added in two relevant places in docs.

docs/paged: remove leftover PageSize try_into()

663e09d

This wasn't reverted when design regarding PageSize changed during code review.

docs: exhaustive overview of statements & best practices

8604a33

In order to avoid API misuse, much knowledge is now shared in a structured way of tables, and best practices are described to aid users.

wprzytula force-pushed the eradicate-unpaged-selects-from-docs branch from 5b39d8f to 8604a33 Compare August 29, 2024 13:20

wprzytula requested review from Lorak-mmk and muzarski August 29, 2024 13:20

muzarski approved these changes Aug 29, 2024

View reviewed changes

Lorak-mmk approved these changes Aug 29, 2024

View reviewed changes

fee-mendes approved these changes Sep 2, 2024

View reviewed changes

dkropachev merged commit edc2ae2 into scylladb:main Sep 4, 2024
12 checks passed

wprzytula deleted the eradicate-unpaged-selects-from-docs branch September 4, 2024 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Eradicate unpaged SELECTs and warn about their drawbacks #1068

docs: Eradicate unpaged SELECTs and warn about their drawbacks #1068

wprzytula commented Aug 27, 2024 •

edited

Loading

github-actions bot commented Aug 27, 2024 •

edited

Loading

wprzytula commented Aug 27, 2024

fee-mendes left a comment

muzarski commented Aug 28, 2024

muzarski left a comment

wprzytula commented Aug 28, 2024

wprzytula commented Aug 28, 2024 •

edited

Loading

fee-mendes commented Aug 28, 2024 •

edited

Loading

wprzytula commented Aug 28, 2024

muzarski commented Aug 28, 2024

muzarski commented Aug 28, 2024 •

edited

Loading

wprzytula commented Aug 28, 2024

fee-mendes commented Aug 28, 2024

fee-mendes commented Aug 28, 2024 •

edited

Loading

wprzytula commented Aug 28, 2024

fee-mendes commented Aug 28, 2024

Lorak-mmk Aug 29, 2024

wprzytula Aug 29, 2024

wprzytula commented Aug 29, 2024 •

edited

Loading

wprzytula commented Aug 29, 2024 •

edited

Loading

Lorak-mmk Aug 29, 2024

wprzytula Aug 29, 2024

Lorak-mmk Aug 29, 2024

Lorak-mmk Aug 29, 2024

wprzytula Aug 29, 2024

Lorak-mmk Aug 29, 2024

wprzytula commented Aug 29, 2024 •

edited

Loading


		> To sum up, for SELECTs that may return return a lot of data prefer paged queries,
		> e.g. with `Session::query_iter()` (see [Paged queries](paged.md)).

		\| Suitable operations \| - in general: operations with empty result set (non-SELECTs)</br> - as possible optimisation: SELECTs with LIMIT clause \| - in general: all SELECTs \|

		For more detailed comparison and more best practices, see [doc page about paging](paged.md).

docs: Eradicate unpaged SELECTs and warn about their drawbacks #1068

docs: Eradicate unpaged SELECTs and warn about their drawbacks #1068

Conversation

wprzytula commented Aug 27, 2024 • edited Loading

Motivation

What's done

Note to reviewers

Pre-review checklist

github-actions bot commented Aug 27, 2024 • edited Loading

wprzytula commented Aug 27, 2024

fee-mendes left a comment

Choose a reason for hiding this comment

muzarski commented Aug 28, 2024

muzarski left a comment

Choose a reason for hiding this comment

wprzytula commented Aug 28, 2024

wprzytula commented Aug 28, 2024 • edited Loading

fee-mendes commented Aug 28, 2024 • edited Loading

wprzytula commented Aug 28, 2024

muzarski commented Aug 28, 2024

muzarski commented Aug 28, 2024 • edited Loading

wprzytula commented Aug 28, 2024

fee-mendes commented Aug 28, 2024

fee-mendes commented Aug 28, 2024 • edited Loading

wprzytula commented Aug 28, 2024

fee-mendes commented Aug 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wprzytula commented Aug 29, 2024 • edited Loading

wprzytula commented Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wprzytula commented Aug 29, 2024 • edited Loading

wprzytula commented Aug 27, 2024 •

edited

Loading

github-actions bot commented Aug 27, 2024 •

edited

Loading

wprzytula commented Aug 28, 2024 •

edited

Loading

fee-mendes commented Aug 28, 2024 •

edited

Loading

muzarski commented Aug 28, 2024 •

edited

Loading

fee-mendes commented Aug 28, 2024 •

edited

Loading

wprzytula commented Aug 29, 2024 •

edited

Loading

wprzytula commented Aug 29, 2024 •

edited

Loading

wprzytula commented Aug 29, 2024 •

edited

Loading